Wire tier-2b Langfuse Generation fixtures#186
Merged
Conversation
Move 023 (generation rendering + payload truncation) and 024 (prompt linkage) from _UNIT_TESTED_FIXTURES into _SUPPORTED_FIXTURES, driven through a LangfuseObserver + InMemoryLangfuseClient recorder. Completes the Langfuse tier of the fixture-harness catch-up; test-only, no library change, no pin bump. Adds a generation runner that asserts the Generation observation (model / modelParameters / usage / input-output payload + prompt-entity link) nested under the node span, plus content_repeat synthesis + payload_byte_cap truncation (023) and a prompt-backend Langfuse reference carried via PromptResult.observability_entities (024). The value-matcher gained nested-dict recursion for metadata.prompt. No deferrals.
There was a problem hiding this comment.
Pull request overview
This PR completes the YAML conformance-harness wiring for the Tier 2b Langfuse “Generation” fixtures (023/024), moving them from unit-tested-only coverage into the main fixture runner. The changes are test-only and extend the harness to validate Langfuse Generation rendering, truncation behavior, and prompt-entity linkage per the spec mapping.
Changes:
- Added a new Langfuse Generation fixture driver (
_run_langfuse_generation_fixture) and Generation-field assertions integrated into the Langfuse observation-tree matcher. - Extended the Langfuse value matcher to recurse into nested mappings so placeholder tokens can match inside nested objects (needed for fixture 024).
- Added
content_repeatsynthesis for typed messages and carried Langfuse prompt references viaPromptResult.observability_entities.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address review feedback on PR #186: the input_is_raw_string_with_marker check matched a bare "[truncated" substring, which could false-positive on arbitrary content. Tighten it to a regex matching the full marker shape, matching the observer's _TRUNCATION_MARKER_TEMPLATE and consistent with the OTel marker_pattern approach.
Fold in the python-side nuances from spec's Tier 2 review: - _assert_langfuse_observation_tree now disambiguates same-(type, name) sibling observations (032's per-instance "process" spans) by their scalar metadata rather than emission order, so the assertions can't bind the wrong sibling if the observer's emission order shifts. - _run_invocation_id_case now asserts the fixture's top-level verbatim invocation_id clause (035/036) against the in-memory recorder's raw trace.id, so it isn't half-asserted across the OTel and Langfuse runners.
Comment on lines
+2658
to
+2662
| # A regular NON-empty nested mapping (e.g. 024 metadata.prompt): recurse per | ||
| # key so inner tokens (rendered_hash: <any-string>) still apply. Subset over | ||
| # keys -- every expected key must be present and match; actual MAY carry | ||
| # extras. An empty expected dict falls through to exact equality below | ||
| # (rather than vacuously matching any mapping). |
Comment on lines
+2871
to
+2874
| graph, state_cls, provider = _build_simple_llm_graph(case, populate_caller_metadata=False) | ||
| client = InMemoryLangfuseClient() | ||
| cfg = cast("dict[str, Any]", case.get("langfuse_observer") or {}) | ||
| lf_kwargs: dict[str, Any] = {"client": client} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the Langfuse tier of the conformance-harness fixture catch-up: wire
the two Langfuse Generation fixtures into the YAML harness. Test-only, no
library change, no pin bump.
Wired (2)
Moved from
_UNIT_TESTED_FIXTURESto_SUPPORTED_FIXTURES:metadata) plus the payload-truncation fallthrough (input becomes the raw
marker-bearing string once it exceeds the byte cap).
Langfuse Prompt reference) and the absent case.
Harness machinery added
_run_langfuse_generation_fixturebuilds a calls_llm graph, records into anInMemoryLangfuseClientunder the fixture'sdisable_provider_payload/payload_byte_capconfig, and asserts the Generation observation nested underthe node span.
_assert_langfuse_generation_fieldscovers model / modelParameters / usage /prompt_entity_link and the two input shapes (native message list under the
cap, raw truncated string with the marker over it). The placeholder-capable
fields run through the value matcher.
metadata.prompt(with an inner
<any-string>rendered_hash) matches._materialize_typed_messagesgainedcontent_repeatsynthesis (023), and_render_prompt_resultcarries a backend's Langfuse prompt reference intoPromptResult.observability_entities, which the observer resolves into theGeneration's prompt-entity link (024).
Testing
tests/conformance/test_observability.py: 72 passed, 40 skipped.tests/: 1464 passed, 406 skipped.